310 research outputs found

    Beyond Zipf's Law: The Lavalette Rank Function and its Properties

    Full text link
    Although Zipf's law is widespread in natural and social data, one often encounters situations where one or both ends of the ranked data deviate from the power-law function. Previously we proposed the Beta rank function to improve the fitting of data which does not follow a perfect Zipf's law. Here we show that when the two parameters in the Beta rank function have the same value, the Lavalette rank function, the probability density function can be derived analytically. We also show both computationally and analytically that Lavalette distribution is approximately equal, though not identical, to the lognormal distribution. We illustrate the utility of Lavalette rank function in several datasets. We also address three analysis issues on the statistical testing of Lavalette fitting function, comparison between Zipf's law and lognormal distribution through Lavalette function, and comparison between lognormal distribution and Lavalette distribution.Comment: 15 pages, 4 figure

    Correcting for cryptic relatedness by a regression-based genomic control method

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genomic control (GC) method is a useful tool to correct for the cryptic relatedness in population-based association studies. It was originally proposed for correcting for the variance inflation of Cochran-Armitage's additive trend test by using information from unlinked null markers, and was later generalized to be applicable to other tests with the additional requirement that the null markers are matched with the candidate marker in allele frequencies. However, matching allele frequencies limits the number of available null markers and thus limits the applicability of the GC method. On the other hand, errors in genotype/allele frequencies may cause further bias and variance inflation and thereby aggravate the effect of GC correction.</p> <p>Results</p> <p>In this paper, we propose a regression-based GC method using null markers that are not necessarily matched in allele frequencies with the candidate marker. Variation of allele frequencies of the null markers is adjusted by a regression method.</p> <p>Conclusion</p> <p>The proposed method can be readily applied to the Cochran-Armitage's trend tests other than the additive trend test, the Pearson's chi-square test and other robust efficiency tests. Simulation results show that the proposed method is effective in controlling type I error in the presence of population substructure.</p

    Statistical significance for hierarchical clustering in genetic association and microarray expression studies

    Get PDF
    BACKGROUND: With the increasing amount of data generated in molecular genetics laboratories, it is often difficult to make sense of results because of the vast number of different outcomes or variables studied. Examples include expression levels for large numbers of genes and haplotypes at large numbers of loci. It is then natural to group observations into smaller numbers of classes that allow for an easier overview and interpretation of the data. This grouping is often carried out in multiple steps with the aid of hierarchical cluster analysis, each step leading to a smaller number of classes by combining similar observations or classes. At each step, either implicitly or explicitly, researchers tend to interpret results and eventually focus on that set of classes providing the "best" (most significant) result. While this approach makes sense, the overall statistical significance of the experiment must include the clustering process, which modifies the grouping structure of the data and often removes variation. RESULTS: For hierarchically clustered data, we propose considering the strongest result or, equivalently, the smallest p-value as the experiment-wise statistic of interest and evaluating its significance level for a global assessment of statistical significance. We apply our approach to datasets from haplotype association and microarray expression studies where hierarchical clustering has been used. CONCLUSION: In all of the cases we examine, we find that relying on one set of classes in the course of clustering leads to significance levels that are too small when compared with the significance level associated with an overall statistic that incorporates the process of clustering. In other words, relying on one step of clustering may furnish a formally significant result while the overall experiment is not significant

    Partial correlation analysis indicates causal relationships between GC-content, exon density and recombination rate in the human genome

    Get PDF
    {\bf Background}: Several features are known to correlate with the GC-content in the human genome, including recombination rate, gene density and distance to telomere. However, by testing for pairwise correlation only, it is impossible to distinguish direct associations from indirect ones and to distinguish between causes and effects. {\bf Results}: We use partial correlations to construct partially directed graphs for the following four variables: GC-content, recombination rate, exon density and distance-to-telomere. Recombination rate and exon density are unconditionally uncorrelated, but become inversely correlated by conditioning on GC-content. This pattern indicates a model where recombination rate and exon density are two independent causes of GC-content variation. {\bf Conclusions}: Causal inference and graphical models are useful methods to understand genome evolution and the mechanisms of isochore evolution in the human genome

    Effective Sample Size: Quick Estimation of the Effect of Related Samples in Genetic Case-Control Association Analyses

    Get PDF
    Correlated samples have been frequently avoided in case-control&#xd;&#xa;genetic association&#xd;&#xa; studies in part because the methods for handling them are either not&#xd;&#xa;easily implemented or not widely known. We&#xd;&#xa;advocate one method for case-control association analysis of correlated&#xd;&#xa;samples -- the effective sample size method -- as a simple and&#xd;&#xa;accessible approach that does not require specialized computer programs.&#xd;&#xa;The effective sample size method captures the variance inflation&#xd;&#xa;of allele frequency estimation exactly, and can be used to modify the&#xd;&#xa;chi-square test statistic, p-value, and 95% confidence interval of&#xd;&#xa;odds-ratio simply by replacing the apparent number of allele counts with the&#xd;&#xa;effective ones. For genotype frequency estimation, although a single&#xd;&#xa;effective sample size is unable to completely characterize the variance inflation,&#xd;&#xa;an averaged one can satisfactorily approximate the simulated result.&#xd;&#xa;The effective sample size method is applied to the rheumatoid arthritis&#xd;&#xa;siblings data collected from the North American Rheumatoid Arthritis Consortium (NARAC)&#xd;&#xa;to establish a significant association with the interferon-induced&#xd;&#xa;helicasel gene (IFIH1) previously being identified as a type 1 diabetes&#xd;&#xa;susceptibility locus. Connections between the effective sample size&#xd;&#xa;method and other methods, such as generalized estimation equation,&#xd;&#xa;variance of eigenvalues for correlation matrices, and genomic controls,&#xd;&#xa;are also discussed.&#xd;&#xa

    Likelihood ratio tests in random graph models with increasing dimensions

    Full text link
    We explore the Wilks phenomena in two random graph models: the β\beta-model and the Bradley-Terry model. For two increasing dimensional null hypotheses, including a specified null H0:βi=βi0H_0: \beta_i=\beta_i^0 for i=1,,ri=1,\ldots, r and a homogenous null H0:β1==βrH_0: \beta_1=\cdots=\beta_r, we reveal high dimensional Wilks' phenomena that the normalized log-likelihood ratio statistic, [2{(β^)(β^0)}r]/(2r)1/2[2\{\ell(\widehat{\mathbf{\beta}}) - \ell(\widehat{\mathbf{\beta}}^0)\} -r]/(2r)^{1/2}, converges in distribution to the standard normal distribution as rr goes to infinity. Here, (β)\ell( \mathbf{\beta}) is the log-likelihood function on the model parameter β=(β1,,βn)\mathbf{\beta}=(\beta_1, \ldots, \beta_n)^\top, β^\widehat{\mathbf{\beta}} is its maximum likelihood estimator (MLE) under the full parameter space, and β^0\widehat{\mathbf{\beta}}^0 is the restricted MLE under the null parameter space. For the homogenous null with a fixed rr, we establish Wilks-type theorems that 2{(β^)(β^0)}2\{\ell(\widehat{\mathbf{\beta}}) - \ell(\widehat{\mathbf{\beta}}^0)\} converges in distribution to a chi-square distribution with r1r-1 degrees of freedom, as the total number of parameters, nn, goes to infinity. When testing the fixed dimensional specified null, we find that its asymptotic null distribution is a chi-square distribution in the β\beta-model. However, unexpectedly, this is not true in the Bradley-Terry model. By developing several novel technical methods for asymptotic expansion, we explore Wilks type results in a principled manner; these principled methods should be applicable to a class of random graph models beyond the β\beta-model and the Bradley-Terry model. Simulation studies and real network data applications further demonstrate the theoretical results.Comment: This paper supersedes arxiv article arXiv:2211.10055 titled "Wilks' theorems in the β\beta-model" by T. Yan, Y. Zhang, J. Xu, Y. Yang and J. Zh

    The spatiotemporal response of soil moisture to precipitation and temperature changes in an arid region, China

    Get PDF
    Soil moisture plays a crucial role in the hydrological cycle and climate system. The reliable estimation of soil moisture in space and time is important to monitor and even predict hydrological and meteorological disasters. Here we studied the spatiotemporal variations of soil moisture and explored the effects of precipitation and temperature on soil moisture in different land cover types within the Tarim River Basin from 2001 to 2015, based on high-spatial-resolution soil moisture data downscaled from the European Space Agency's (ESA) Climate Change Initiative (CCI) soil moisture data. The results show that the spatial average soil moisture increased slightly from 2001 to 2015, and the soil moisture variation in summer contributed most to regional soil moisture change. For the land cover, the highest soil moisture occurred in the forest and the lowest value was found in bare land, and soil moisture showed significant increasing trends in grassland and bare land during 2001 similar to 2015. Both partial correlation analysis and multiple linear regression analysis demonstrate that in the study area precipitation had positive effects on soil moisture, while temperature had negative effects, and precipitation made greater contributions to soil moisture variations than temperature. The results of this study can be used for decision making for water management and allocation
    corecore